在本文中,我们专注于分析和改进视觉变压器自我发项层的辍学技术,这很重要,同时令人惊讶地被先前的作品忽略了。特别是,我们对三个核心问题进行研究:首先,自我发挥层的下降是什么?不同于文献中的注意力重量不同,我们建议在注意矩阵计算之前向前移动辍学操作,并将钥匙设置为辍学单元,从而产生一种新颖的辍学效果。从理论上讲,我们验证了该方案是否有助于保持注意力重量的正则化和概率特征,从而减轻了过度拟合问题的特定模式,并增强了模型以捕获重要信息;第二,如何在连续层中安排下降比?与利用所有层的恒定下降比相反,我们提出了新的减少时间表,该计划逐渐降低了沿自我注意力层的堆叠比率。我们通过实验验证提出的时间表可以避免在低水平特征中过度贴合,并且在高级语义中缺失,从而提高了模型训练的稳健性和稳定性;第三,是否需要执行结构化辍学操作为CNN?我们尝试基于补丁的辍学操作区块,发现CNN的这种有用的技巧对于VIT并不是必需的。考虑到以上三个问题的探索,我们提出了一种新颖的Dropkey方法,该方法将密钥视为下降单元和利用下降比的减少时间表,以一般方式改善VIT。全面的实验证明了Dropkey对各种VIT体系结构的有效性,\ Emph {e.g。} T2T和Volo以及各种视觉任务,\ Emph {e.g。},图像分类,对象检测,人类对象相互作用和人体形状检测和人体形状恢复。代码将在接受后发布。
translated by 谷歌翻译
我们提出了一种小型任务,可以衡量人们如何基于观察单个(实验1)或几个(实验2)对象对之间的因果相互作用来概括物体的因果动力。我们提出了一种计算建模框架,可以在我们的任务环境中综合人类的泛化模式,并阐明人们如何有效地浏览可能的因果函数和类别的组成空间。我们的建模框架结合了使用代理和收件人对象的特征和关系的因果函数发生器,以及贝叶斯非参数推断过程,以控制基于相似性的概念。我们的模型具有自然的“资源合理的”变体,可以在描述参与者时优于一个天真的贝叶斯账户,特别是在我们的行为实验中再现透明阶效应和因果不对称。我们认为,该建模框架为真实世界因果概念提供了计算上的合理机制。
translated by 谷歌翻译
大型预先接受的变压器的语言模型,如BERT大大改变了自然语言处理(NLP)字段。我们展示了对最近的工作的调查,这些工作使用这些大型语言模型通过预先训练,提示或文本生成方法来解决NLP任务。我们还提出了使用预先训练的语言模型来生成培训增强或其他目的的数据的方法。我们在讨论有关未来研究的局限性和建议方向的结论。
translated by 谷歌翻译
音频数据增强是培训深度神经网络以解决音频分类任务的关键步骤。在本文中,我们在Matlab中引入了一个新型音频数据增强库的录音机。我们为RAW音频数据提供了15种不同的增强算法,8用于频谱图。我们有效地实施了几种增强技术,其有用性在文献中被广泛证明。据我们所知,这是最大的Matlab音频数据增强图书馆可自由使用。我们验证了我们在ESC-50数据集上评估它们的算法的效率。可以在https://github.com/lorisnanni/audiogmenter下载工具箱及其文档。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译